2-back-propagation-algorithm-cover

Introduction

In the introduction, we introduced the Artificial Neuron and demonstrated how it could be deployed to recognise patterns. However, the single artificial neuron does have its limitations, the most well-known of which is named the ‘XOR Problem’. In the 1970s, a strong criticism was levied at the artificial neuron as a processing unit by mathematicians Marvin Minsky and Seymour Papert, who showed that no matter how the Perceptron model is programmed, it was incapable of simulating a basic digital exclusive OR (aka. ‘XOR’) function. This function was at the time relatively simple to perform using digital electronics. This criticism, alongside several additional strong arguments against ANNs, asserted by Minsky and Papert, led to a diminishing scope of research being undertaken into ANNs for many years.

The multilayer neural network

The XOR problem turned out to be relatively easy to overcome. In fact, before Minsky and Papert had even published their criticism, Russian mathematician Andrei Kolmogorov had already demonstrated that a three-layer network of Artificial Neurons, with sufficient neurons in each layer, could learn any known function that was achievable in any other processing medium.

The multilayer neural network

Figure 1: The multilayer neural network

Generally, it is multilayer artificial neural networks rather than single artificial neurons that are used today and they are capable of undertaking many complex tasks. The style of network illustrated above is generally referred to as a ‘Feed-Forward’ network as data is transferred in a single direction (from input through to output) or as ‘Multi-Layer Perceptrons’ (MLPs).

Worked example 1

A feed-forward network using the sigmoid activated Perceptron is illustrated below.

Inputs	Value	Weights	Value
x1	0.6	X₁N₁	0.2
x2	0.2	X₁N₂	0.4
		X₂N₁	0.3
		X₂N₂	0.7
		N₁N₃	0.2
		N₂N₃	0.5
Activation Function: $φ^{s i g} ({n e t}_{j}, α) = \frac{1}{1 + e x p (- α \cdot {n e t}_{j})}$ assume slope constant (α) = 1.

Figure 2: ANN- Study exercise 1

Given its input and weight parameters the output (O) from this network is calculated below:

Neuron N₁

The output from the summation block (net_N1) is calculated as follows:

The output from the activation function (O_N1) is calculated as follows:

${n e t}_{N 1} = (0.6 \times 0.2) + (0.2 \times 0.3) = 0.18$

$O_{N 1} = φ^{s i g} ({n e t}_{N 1}) = \frac{1}{1 + e x p (- 0.18)} = 0.545$

Neuron N₂

The output from the summation block (net_N2) is calculated as follows:

The output from the activation function (O_N2) is calculated as follows:

${n e t}_{N 2} = (0.6 \times 0.4) + (0.2 \times 0.7) = 0.38$

$O_{N 2} = φ^{s i g} ({n e t}_{N 2}) = \frac{1}{1 + e x p (- 0.38)} = 0.594$

Neuron N₃

The output from the summation block (net_N3) is calculated as follows:

The output from the activation function (O) is calculated as follows:

${n e t}_{N 3} = (0.545 \times 0.2) + (0.594 \times 0.5) = 0.406$

$O = φ^{s i g} ({n e t}_{N 3}) = \frac{1}{1 + e x p (- 0.406)} = 0.6$

Teaching a multilayer neural network to learn

Having established the basics of artificial neural network, we shall now consider some practical network implementations, which are normally distinguished by how they are structured, how they function, or how they are trained.

There are innumerable distinct ANN implementations that have been proposed over the years but within these, there are a small group that are considered ‘classic’ implementations. These are:

Feed-forward networks that use the Back Propagation (BP) Algorithm for learning,
Hopfield Networks,
Competitive Networks, and
Spiking Neural Networks.

We will consider each of these implementations in turn, starting with feed-forward networks that use the back-propagation algorithm for learning.

Supervised learning vs. unsupervised learning

Within the field of machine learning, there are two main classifications of learning mechanism: supervised and unsupervised. The main difference between these classifications is that in supervised learning the design has prior knowledge of what output to expect for a given input, this is known as a ground truth. Under this form of learning, a programmer can feed an ANN with a known input and assess its output against an expected output. This known input + expected output combination is known as a training pair. The goal of training in this scenario, is for the ANN to modify its transfer function to approximate the desired input/output relationship, which it establishes by processing many training pairs.

The basic supervised learning mechanism

Figure 3: The basic supervised learning mechanism

Alternatively, in unsupervised learning, the programmer does not know what output to expect for a given input. Under this form of learning, the learning mechanism is governed by a series of algorithms or rules that seek to distinguish and classify outputs based on structural patterns that it finds in similar inputs. The goal of training under this scenario is for the ANN to adapt its transfer function to identify input patterns and find those that are structurally similar.

an unsupervised learning mechanism

Figure 4: An unsupervised learning mechanism

Example of supervised learning

Step 1 - Networks that employ a supervised form of learning are commonly used for simple pattern recognition and mapping tasks, in which known inputs (patterns) are assigned to individual outputs to establish a set of training pairs, as illustrated:

Pattern 1

Assigning digital values

Known inputs assigned to expected outputs

Training Pair
I₁	1
I₂	0	O₁	1
I₃	0	O₂	0
I₄	1

Pattern 2

pattern 2 assigning digital values

Training Pair
I₁	0
I₂	1	O₁	0
I₃	1	O₂	1
I₄	0

Step 2 - These training pairs are then fed into a network such as this, which determine actual outputs (O₁ and O₂), which will be based upon the weights of the network:

A network with 4 inputs and 2 outputs

Figure 5: A network with 4 inputs and 2 outputs

Step 3 - The success of the network is then assessed by comparing the actual output (as derived in Step 2) against the expected output (as assigned in Step 1).

Step 4 - The ANN weights can then be modified using a training mechanism such that on future iterations the ANN will perform better at achieving the desired transfer function. Training can be repeated many times until the actual output of the ANN adequately matches the expected output. The back propagation algorithm is one such example of a training mechanism that could achieve this. We shall go on to describe how this algorithm works in the next section.

The back propagation algorithm

As noted in the previous section, the back propagation algorithm is implemented as a training mechanism within a supervised form of ANN. The purpose of the BP algorithm is to modify the forward pass transfer function of the ANN based on how successfully the network previously approximated the desired input/output relationship.

Once the network has been adequately trained, it will provide the desired output for any new input patterns, e.g. it will have taught itself how to identify patterns.

To summarise, the back propagation algorithm is operated as follow:

The first step is to initialise the network with randomly selected weights (often bounded to between -1 and +1).
Apply a known input (part of a training pair).
The output is calculated via a forward pass through the network.
- This is termed the ‘actual output’ and is likely not to be very accurate on the first pass as the network is initialised randomly.
The ‘actual output’ is compared against the expected output (aka. the ‘target’) and an error value calculated.
- $error (ε) = (O_{target} - O_{act})$
The error (ε) is then used to mathematically change the weights in such a way to reduce the error in future iterations.
- This is termed the ‘reverse pass’ through the network
This process is repeated until the resulting error has reduced to an acceptable value.

We shall illustrate this process in the following two worked examples.

Worked example 2

Consider the following network (assume the neuron is a sigmoid activated Perceptron):

Simple network - worked example 1

Figure 6: Simple network - Worked example 1

The following procedure should be followed.

Step 1 Initialise network with randomly selected weights.

$ω_{AN} = 0.4$

$ω_{BN} = - 0.3$

Step 2 Apply a known input (part of a training pair).

Assume the following Training Pair:

Inputs		Out
A	B	O_target
0.8	0.4	0.7

Step 3 Calculate output

${n e t}_{N} = (A \times ω_{AN}) + (B \times ω_{BN})$

${n e t}_{N} = (0.8 \times 0.4) + (0.4 \times - 0.3) = 0.2$

$O_{act} = φ^{s i g} ({n e t}_{N}) = \frac{1}{1 + e x p (- 0.2)} = 0.550$

Step 4 Calculate Error

Note: The ‘ $O_{a c t} (1 - O_{a c t})$ ’ is a squashing function that is required due to the use of the sigmoid activation function. If we were only using a threshold activation function, this would not be required.

$ε_{N} = O_{a c t} (1 - O_{a c t}) (O_{t a r g e t} - O_{a c t})$

$ε_{N} = 0.550 (1 - 0.550) (0.7 - 0.550)$

$e r r o r (ε_{N}) = 0.037$

Step 5 Modify weights based on the error.

Let $ω_{AN}^{+}$ be the modified (trained) weight of $ω_{AN}$ .

The back propagation algorithm is stated as:

$ω_{AN}^{+} = ω_{AN} + η (ε_{N} \times O_{A})$

Note ‘ $η$ ’ is the learning rate, nominally set to 1 but can be modified to increase or decrease the magnitude of change implemented on each training loop. We shall assume $η = 1$ for these examples.

$ω_{AN}^{+} = ω_{AN} + η (ε_{N} \times O_{act})$

$ω_{AN}^{+} = 0.4 + (0.037 \times 0.550)$

$ω_{AN}^{+} = 0.420$

$ω_{BN}^{+} = ω_{BN} + η (ε_{N} \times O_{act})$

$ω_{BN}^{+} = - 0.3 + (0.037 \times 0.550)$

$ω_{BN}^{+} = - 0.280$

Step 6 Repeat process to minimise the error.

Note: we shall only repeat the process once here to show that the error has decreased.

${n e t}_{N}^{+} = (A \times ω_{AN}^{+}) + (B \times ω_{BN}^{+})$

${n e t}_{N}^{+} = (0.8 \times 0.420) + (0.4 \times - 0.280) = 0.224$

$O_{act}^{+} = φ^{s i g} ({n e t}_{N}^{+}) = \frac{1}{1 + e x p (- 0.224)} = 0.556$

$ε_{N}^{+} = O_{act}^{+} (1 - O_{act}^{+}) (O_{target} - O_{act}^{+})$

$ε_{N}^{+} = 0.556 (1 - 0.556) (0.7 - 0.556) = 0.036$

It should be noted that $ε_{N}^{+} < ε_{N}$ and therefore in just one training loop, our network has improved its capability to approximate the desired transfer function. If more training were conducted this error would continue to decrease, i.e., the network would become ‘trained’.

Worked example 3

The following example extends our understanding of the back propagation algorithm by demonstrating how to integrate multiple sources of error into the algorithm that governs the training of a neuron.

Consider the following network (assume all neurons are sigmoid activated Perceptrons):

Simple network - worked example 2_1

Figure 7: Simple network - Worked example 2

Once again, the following procedure should be followed.

Step 1 Initialise network with randomly selected weights.

$ω_{AN1} = 0.2$ $ω_{N1N2} = - 0.6$

$ω_{BN1} = 0.4$ $ω_{N1N3} = 0.5$

Step 2 Apply a known input (part of a training pair).

Assume the following Training Pair:

Inputs		Outputs
A	B	O_{N2_target}	O_{N3_target}
0.3	0.9	0	1

Step 3 Calculate outputs

${n e t}_{N1} = (A \times ω_{AN1}) + (B \times ω_{BN1}) = 0.420$

$O_{N1_act} = φ^{s i g} ({n e t}_{N1}) = 0.603$

${n e t}_{N2} = (O_{N1_act} \times ω_{N1N2}) = - 0.362$

$O_{N2_act} = φ^{s i g} ({n e t}_{N2}) = 0.410$

${n e t}_{N2} = (O_{N1_act} \times ω_{N1N2}) = - 0.362$

$O_{N3_act} = φ^{s i g} ({n e t}_{N3}) = 0.575$

${n e t}_{N3} = (O_{N1_act} \times ω_{N1N3}) = 0.302$

Step 4 Calculate Errors

$ε_{N 2} = O_{N2_act} (1 - O_{N2_act}) (O_{N2_target} - O_{N2_act})$

$ε_{N 2} = 0.410 (1 - 0.410) (0 - 0.410) = - 0.099$

$ε_{N 3} = O_{N3_act} (1 - O_{N3_act}) (O_{N3_target} - O_{N3_act})$

$ε_{N 3} = 0.575 (1 - 0.575) (1 - 0.575) = 0.104$

$ε_{TOTAL} = |ε_{N2}| + |ε_{N3}| = 0.203$

Step 5 Modify weights based on the error.

Note that we shall be using a learning rate of $η = 1.$

$ω_{N 1 N 2}^{+} = ω_{N 1 N 2} + η (ε_{N 2} \times O_{N 2_act})$

$ω_{N 1 N 2}^{+} = - 0.6 + (- 0.099 \times 0.410) = - 0.641$

$ω_{N 1 N 3}^{+} = ω_{N 1 N 3} + η (ε_{N 3} \times O_{N 3_act})$

$ω_{N 1 N 3}^{+} = 0.5 + (0.104 \times 0.575) = - 0.560$

$ω_{A N 1}^{+} = ω_{A N 1} + η (ε_{N 1} \times O_{N 1_act})$

$ω_{B N 1}^{+} = ω_{B N 1} + η (ε_{N 1} \times O_{N 1_act})$

Note here that we require to know the error from neuron N1 ‘ $ε_{N 1}$ ’. This can be extrapolated by from the errors of N2 and N3:

$ε_{N 1} = (ε_{N 2} ω_{N 1 N 2} + ε_{N 3} ω_{N 1 N 3}) \times O_{N 1_act} (1 - O_{N 1_act})$

Note: The ‘ $O (1 - O)$ ’ is once again a squashing function that is required due to the use of the sigmoid activation function within neuron N1.

$ε_{N 1} = (ε_{N 2} ω_{N 1 N 2} + ε_{N 3} ω_{N 1 N 3}) \times O_{N 1_act} (1 - O_{N 1_act})$

$ε_{N 1} = ((- 0.099 + - 0.6) + (0.104 \times 0.5)) \times (1 - 0.603)$

$ε_{N 1} = 0.027$

$ω_{A N 1}^{+} = ω_{A N 1} + η (ε_{N 1} \times O_{N 1_act}) = 0.2 + (0.027 \times 0.603)$

$ω_{A N 1}^{+} = 0.216$

$ω_{B N 1}^{+} = ω_{B N 1} + η (ε_{N 1} \times O_{N 1_act}) = 0.4 + (0.027 \times 0.603)$

$ω_{B N 1}^{+} = 0.416$

Step 6 Repeat process to minimise the error.

Note: we shall only repeat the process once here to show that the error has decreased.

${n e t}_{N 1}^{+} = (A \times ω_{AN 1}^{+}) + (B \times ω_{BN 1}^{+}) = 0.439$

$O_{N 1_act}^{+} = φ^{s i g} ({n e t}_{N 1}^{+}) = 0.608$

${n e t}_{N 2}^{+}_{} = (O_{N 1_act}^{+} \times ω_{N 1 N 2}^{+}) = - 0.390$

$O_{N 2_act}^{+} = φ^{s i g} ({n e t}_{N 2}^{+}) = 0.404$

$ε_{N 2}^{+} = O_{N 2_act}^{+} (1 - O_{N 2_act}^{+}) (O_{N 2_target} - O_{N 2_act}^{+}) = - 0.097$

${n e t}_{N 3}^{+}_{} = (O_{N 1_act}^{+} \times ω_{N 1 N 3}^{+}) = 0.340$

$O_{N 3_act}^{+} = φ^{s i g} ({n e t}_{N 3}^{+}) = 0.584$

$ε_{N 3}^{+} =$ O N 3 _act + ( 1 − O N 3 _act + ) ( O N 3 _target − O N 3 _act + ) = 0 . 1 0 1

$ε_{T O T A L}^{+} = |ε_{N 2}^{+}| + |ε_{N 3}^{+}| = 0.198$

Once again, it should be noted that $ε_{T O T A L}^{+} < ε_{T O T A L}$ and therefore our network has improved its capability to approximate the desired transfer function, i.e. is becoming trained.

Considerations for training the BP algorithm

via Gfycat

How to train the ANN to recognise multiple input patterns

In the previous section we illustrated how to train the ANN to recognise and identify a given input pattern, however, it should be noted that such a network can be trained to recognise and identify multiple input patterns using the same algorithm. The technique to achieve this is quite simple; the training mechanism needs only be fed each input successively and to continue training until the error is minimised for all inputs. Suppose we want to train the network to identify the first three letters of the alphabet as illustrated in Figure 8:

Input patterens A-C

Figure 8: Input patterns for A – C

The process for training the network would be as follows:

Flowcahrt - training multiple inputs

Figure 9: Flow chart - Training multiple inputs

It should be noted that a common mistake is to attempt to fully train the network for one letter at a time before moving on to the next, i.e., training the ANN until the error is minimised when processing ‘A’ prior to beginning to train ‘B’. This technique fails as the network (trained to recognise ‘A’) would ‘forget’ as it is trained for the second letter independently.

When to stop training

It may seem that the question of when to stop training can be responded to with an obvious answer; training should stop when the network is able to perfectly recognise and correctly identify all input patterns. However, there are some additional considerations that may be required:

It may be desirable for the network to be able to identify noisy or distorted patterns, these we might term ‘general samples’ and would include inputs such as hand-written words. A hand-written word may be generally recognisable but is also particularly unique. It would not be desirable for the network to be trained for 100% accuracy to its training pairs, each of which is a ‘particular sample’, as this demand for 100% accuracy would degrade the networks capability to recognise patterns that are categorically similar but not exact.

One method for overcoming this issue is to use a second set of known input + expected output pairs that are not assigned for training but are assigned for network performance validation; as such these pairs are termed the validation set. A validation set would contain general samples (i.e. noisy or distorted samples) of the patterns to be recognised. When using a validation set, the network is trained in the same manner (using the training pairs only) and then performance is measured against the validation set separately. This allows the programmer to determine at what point the network has become optimally trained. after which further training would only degrade the capability of the network to identify general samples.

Using a validation set to determine the optimal training point

Figure 10: Using a validation set to determine the optimal training point

The programmer may have an overall performance target for the network (e.g., correctly identifying letters) alongside performance targets for each input (e.g., ‘A’, ‘B’ and ‘C’ individually). Without such layered targets it could be possible for a network to meet its overall performance criteria (e.g., correctly identifying letters 90% of the time) whilst performing very poorly on one individual performance criteria (e.g., the network might identify ‘A’, and ‘B’ perfectly but regularly fail on ‘C’). This could be overcome using a layered set of targets to govern when to stop training the network, e.g.:

Flow chart - Individual and total error

Figure 11: Flow chart - Individual and total error

Programmers may also consider the balance between performance and processing requirements, i.e. it may be acceptable to cede a marginal degree of performance if the cost to achieve that performance was significant in computation time. This could be a particularly pertinent consideration in applications where the network is expected to adapt quickly to new conditions.

Performance vs computer time

Figure 12: Performance vs computer time

Ultimately the choice of when to stop training the network will depend upon the application for the ANN and the requirements or constraints inherent to that functional environment.

Problems with the BP Algorithm

The use of the back propagation algorithm has several problems associated with it, the most well known being the problem of ‘Local Minima’. The point to note here is that the BP algorithm functions by modifying the network configuration in a stepwise manner (each step seeking to reduce error) until finding a configuration of weights resulting in minimal error, i.e.:

BP algorithm finding a point of minimal error

Figure 13: BP Algorithm finding a point of minimal error

However, in its most basic form the BP algorithm is incapable of finding an alternative configuration of even better performance (a true minimum), as it becomes ‘stuck’ in one of its local minima.

Illustration of the problem of Local Minima

Figure 14: Illustration of the problem of local minima

There are several potential solutions to this problem:

One solution may be to reset the network to an untrained condition with a different set of random initialising weights and to train the network several times. Comparing the minimum error following each period of training should highlight to the programmer configurations where the network is stuck in local minima.
Another solution is to add ‘momentum’ to the stepwise error minimising function (the BP Algorithm). In this solution, the training mechanism may be able to roll out of local minima and continue towards a true minimum. In practice, this is achieved by determining the change of weights based on the current training cycle AND several previous training cycles. i.e.:

$ω^{+} = ω + (c h a n g e)^{this cycle} + n (c h a n g e)^{previous cycle} . . . where N < 1$

Implementation of the BP algorithm using a momentum function

Figure 15: Implementation of the BP Algorithm using a momentum function

There are many other problems with the BP algorithm that have arisen over the years, normally dependent on specific applications and requirements. As such there are also many implementations of the BP algorithm each of which seeks to overcome some particular problem in the standard model.

Determining network size

There is no definitive method for determining the best size and structure of an ANN but there are several points to consider when building a network. The standard implementation is to use an input layer, one or perhaps two hidden layers and an output layer as illustrated below:

The multilayer neural network

Figure 16: The multilayer neural network

In this implementation:

the size of the input layer is determined by the structure of the input data. e.g. to process the letter ‘A’ (drawn on a 5 x 7 matrix) the input layer would require 35 input neurons (one for each pixel of the image).
the size of the output layer is determined by the number of required outputs and how the programmer decides to code each output. e.g. to classify ‘A’, ‘B’ and ‘C’ the programmer might choose three output neurons.
the size of the hidden layers is not determined so definitively. In fact, a range of sizes may be able to achieve the functional requirements of the network. In practice, the size of hidden layers is often determined by trial and error or by reviewing the structure of other functionally similar networks that have been successfully implemented.

For example, a basic letter recognition ANN with 35 inputs and 26 outputs (recognising 26 letters of the alphabet) struggles to train effectively with less than 6 neurons in the hidden layer and becomes inefficient with more than around 22 neurons in the hidden layer.

Summary

In summary:

In this section, we introduced the Multilayer Neural Network and reviewed how data is passed through the network in its forward pass.
We then reviewed several methods for teaching a Multilayer Neural Network to learn, considering the difference between supervised and unsupervised learning.
The back propagation algorithm was introduced as a form of training a Multilayer Neural Network in a supervised form.
Finally, several considerations were discussed for the practical implementation of building a basic ANN using back propagation.

Back Propagation Algorithm

Introduction

The multilayer neural network

Figure 1: The multilayer neural network

Worked example 1

Figure 2: ANN- Study exercise 1

Teaching a multilayer neural network to learn

Supervised learning vs. unsupervised learning

Figure 3: The basic supervised learning mechanism

Figure 4: An unsupervised learning mechanism

Example of supervised learning

Figure 5: A network with 4 inputs and 2 outputs

The back propagation algorithm

Worked example 2

Figure 6: Simple network - Worked example 1

Worked example 3

Figure 7: Simple network - Worked example 2

Considerations for training the BP algorithm

How to train the ANN to recognise multiple input patterns

Figure 8: Input patterns for A – C

Figure 9: Flow chart - Training multiple inputs

When to stop training

Figure 10: Using a validation set to determine the optimal training point

Figure 11: Flow chart - Individual and total error

Figure 12: Performance vs computer time

Problems with the BP Algorithm

Figure 13: BP Algorithm finding a point of minimal error

Figure 14: Illustration of the problem of local minima

Figure 15: Implementation of the BP Algorithm using a momentum function

Determining network size

Figure 16: The multilayer neural network

Summary